CarPrices is a data set that contains 80 samples of Cadillac cars. We will take a look specifically at Price as a factor of Mileage for the Deville model, and take a look at how the kind of Trim style the Deville model has greatly influences the price.
Below is the model for simple linear regression
\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon_i\)
# Assuming 'Deville' is your dataframe containing the data
deville_lm <- lm(Price ~ Mileage, data = Deville)
b <- coef(deville_lm)
p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers",
color = ~Trim,
text = ~paste("Trim: ", Trim)) %>%
layout(title = "Price vs Mileage",
xaxis = list(title = "Mileage"),
yaxis = list(title = "Price"))
# Add regression line to the plot
p <- add_trace(p, x = Deville$Mileage, y = b[1] + b[2]*Deville$Mileage,
type = "scatter", mode = "lines", line = list(color = "green"))
# Print the plot
p
# Set up a 1x2 grid for plots
par(mfrow=c(1,3))
plot(deville_lm, which=1)
qqPlot(deville_lm$residuals, id=FALSE)
plot(deville_lm$residuals)
summary(deville_lm)
##
## Call:
## lm(formula = Price ~ Mileage, data = Deville)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4296 -2986 1027 1881 3870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41106.3054 1295.4696 31.731 < 2e-16 ***
## Mileage -0.2461 0.0607 -4.055 0.000362 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2667 on 28 degrees of freedom
## Multiple R-squared: 0.37, Adjusted R-squared: 0.3475
## F-statistic: 16.44 on 1 and 28 DF, p-value: 0.0003624
Adjusted R-squared: 0.3475
Below is the equation for a two lines linear regression model
\(\underbrace{Y_i}_\text{Price} = \underbrace{β0 + β1X1i + β2X2i}_\text{ E(Yi)} + ϵ_i\)
\[ X_{2i} = \begin{cases} 1 & \text{if Trim = DHS Sedan 4D or DTS Sedan 4D"} \\ 0 & \text{if Trim = Sedan 4D} \end{cases} \]
# Create the scatter plot using plotly, coloring points based on the "Trim" variable
p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers",
color = ~Trim,
text = ~paste("Trim: ", Trim)) %>%
layout(title = "Price vs Mileage",
xaxis = list(title = "Price"),
yaxis = list(title = "Mileage"))
b <- coef(deville_lm)
p <- add_trace(p, x = Deville$Mileage, y = b[1] + b[2]*Deville$Mileage,
type = "scatter", mode = "lines", line = list(color = "skyblue"))
p
# Find a variable for cadillac deville that you can add, that if you add, makes for a better fit.
# Try 2 lines, see what splits the values. Look at r2 values,
# Do just mileage
# When you add the variable, you will see the r2 jump.
split based on trim copy equation except for last change in slope term from statstics notebook do 3 graphs for new lines with regression model
Deville <- Deville %>%
mutate(
Trim_Case = case_when(
Trim %in% c("DHS Sedan 4D", "DTS Sedan 4D") ~ 1,
Trim == "Sedan 4D" ~ 0
)
)
# Fit linear models
lm_trim <- lm(Price ~ Mileage + Trim_Case, data = Deville)
# Obtain fitted values
# fitted_values1 <- predict(lm_dts_dth)
# Get coefficients
bd <- coef(lm_trim)
# Create scatter plot
p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers",
color = ~Trim,
text = ~paste("Trim: ", Trim)) %>%
layout(title = "Price vs Mileage",
xaxis = list(title = "Mileage"),
yaxis = list(title = "Price"))
# Add regression lines
p <- add_trace(p, x = Deville$Mileage, y = bd[1] + bd[2]*Deville$Mileage,
type = "scatter", mode = "lines", line = list(color = "lightblue"))
p <- add_trace(p, x = Deville$Mileage, y = (bd[1] + bd[3]) + bd[2]*Deville$Mileage,
type = "scatter", mode = "lines", line = list(color = "pink"))
p # Print the plot
par(mfrow=c(1,3))
plot(lm_trim, which=1)
qqPlot(lm_trim, id=FALSE)
plot(lm_trim$residuals)
summary(lm_trim)
##
## Call:
## lm(formula = Price ~ Mileage + Trim_Case, data = Deville)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1612.02 -220.33 59.19 380.80 1771.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.871e+04 3.895e+02 99.38 < 2e-16 ***
## Mileage -3.054e-01 1.747e-02 -17.48 2.99e-16 ***
## Trim_Case 5.347e+03 2.973e+02 17.99 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 753.8 on 27 degrees of freedom
## Multiple R-squared: 0.9515, Adjusted R-squared: 0.9479
## F-statistic: 264.7 on 2 and 27 DF, p-value: < 2.2e-16
R squared: 0.9572